This is the introduction paragraph. In this file we’re cleaning the data and setting up files for later processing.
Load, clean, recode and save
Load files
# Load the neighborhoods GeoJSON# creates data["Parcels"] containing geojson data.data = {}for feature in [selector] + features: geofile = get_newest_file( precious_folder, feature, ext=".geojson" )#logger.debug(geofile) logger.info(f"Found {feature}: {geofile}" ) data[feature] = gpd.read_file( geofile )# convert all feature files to same CRS mapping as Civic_Associationsfor feature in features: data[feature] = data[feature].to_crs( data[selector].crs )
INFO | __main__:<module>:7 - Found Civic_Associations: /home/john/projects/ssg-thefan-analysis/precious/Civic_Associations-2025-05-16.geojson
INFO | __main__:<module>:7 - Found Addresses: /home/john/projects/ssg-thefan-analysis/precious/Addresses-2025-09-25.geojson
INFO | __main__:<module>:7 - Found Parcels: /home/john/projects/ssg-thefan-analysis/precious/Parcels-2025-09-25.geojson
Drop columns
# columns to drop:shared_drops = ['CreatedBy','CreatedDate','EditBy','EditDate']## file specific column drop mappingsdrop_columns = {"Civic_Associations" : ['OBJECTID'] + shared_drops,"Addresses" : ['OBJECTID'] + shared_drops,"Parcels" : ['OBJECTID'],}for feature in [selector] + features: data[feature] = data[feature].drop(columns=drop_columns[feature])
Spatial join, select only parcels and addresses in the Fan.
# Pull out only FDA from Civic_Associations and store itdata[selector_key] = data[selector][ data[selector]["Name"] == selector_key ]# Select only features from the selector_key (FDA). perform spatial join.for feature in features: predicate ="overlaps"if feature=="Neighborhoods"else"within" data[feature+"_in_fan"] = gpd.sjoin(data[feature], data[selector_key], predicate=predicate, how="inner")
Drop unnecessary columns
# Drop columns created from spatial joinshared_drops = ["index_right","AdoptionDate","ChangeDate","Shape__Area","Shape__Length"]drop_columns = {"Addresses" : ['GlobalID'] + shared_drops,"Parcels" : ['MaskedOwner','GlobalID_left','GlobalID_right'] + shared_drops,}for feature in features: feature_name = feature+"_in_fan" data[feature_name] = data[feature_name].drop(columns=drop_columns[feature])
Create columns and recode columns in Parcels
gdf = data["Parcels_in_fan"]gdf["OwnerOccupied"] = gdf.apply(lambda row: str(row["MailAddress"]).startswith(str(row["AsrLocationBldgNo"]))andstr(row["MailCity"]).upper() =="RICHMOND"andstr(row["MailState"]).upper() =="VA"andstr(row["MailZip"]) =="23220", axis=1).map({True: 1, False: 0})# Rule 1: If PropertyClass contains 'Commercial'gdf.loc[gdf["PropertyClass"].str.contains("Commercial", case=False, na=False), "LandUse"] ="Commercial"# Rule 2: If PropertyClass contains 'Condo'gdf.loc[gdf["PropertyClass"].str.contains("Condo", case=False, na=False), "LandUse"] ="Multi-Family"gdf["SharedGeometry"] = gdf.duplicated(subset="geometry", keep=False).astype(int)# Create FanUsemapping = {"Single Family": "FanResidential","Multi-Family": "FanResidential","Duplex (2 Family)" : "FanResidential","Commercial": "FanBusiness","Industrial": "FanBusiness","Office" : "FanBusiness","Institutional" : "FanBusiness","Mixed-Use" : "FanMixed-Use"}gdf["FanUse"] = gdf["LandUse"].map(mapping).fillna("FanOther")# Ensure PropertyClass is string and safe for NaNsmask = gdf["PropertyClass"].fillna("").str.contains("vacant|parking|common|garage|storage|tower|space", case=False, na=False)# Apply recodegdf.loc[mask, "FanUse"] ="FanOther"# A new variable to permit easy selectinggdf["FanUseType"] ="FanIgnore"# defaultgdf.loc[(gdf["OwnerOccupied"] ==1) & (gdf["Mailable"] ==1), "FanUseType"] ="FanOwner"gdf.loc[(gdf["OwnerOccupied"] ==0) & (gdf["Mailable"] ==1), "FanUseType"] ="FanRental"# Reset if it's one of the FanOther property classes.gdf.loc[mask, "FanUseType"] ="FanIgnore"mapping = {"FanResidential": 1,"FanBusiness": 10,"FanMixed-Use" : 20,"FanOther": 99}gdf["FanUseOrder"] = gdf["FanUse"].map(mapping).fillna(99)data["Parcels_in_fan"] = gdf
Clean columns in Addresses
Save Parcels_in_fan and Addresses_in_fan for later use.
for feature in features: feature_name = feature+"_in_fan"# store to parquet using pyarror (workflow tip from chatgpt data[feature_name].to_parquet(f"{feature_name}.parquet",engine="pyarrow") logger.info(f"Saving: {feature_name}.parquet" )# create dataframe without spatial geometries and store to CSV gdf = data[feature_name].drop(columns="geometry") gdf.to_csv(f"{feature_name}.csv", index=False) logger.info(f"Saving: {feature_name}.csv" )
INFO | __main__:<module>:6 - Saving: Addresses_in_fan.parquet
INFO | __main__:<module>:11 - Saving: Addresses_in_fan.csv
INFO | __main__:<module>:6 - Saving: Parcels_in_fan.parquet
INFO | __main__:<module>:11 - Saving: Parcels_in_fan.csv
Examine Addresses
In the following sections we review the Addresses file. If changes are necessary, iterate with the cleaning sections and rerun the report until everyting is clean.
List count addresses by zipcode
Loading ITables v2.5.2 from the internet...
(need help?)
List odd addresses
This should return blank. These were cleaned/fixed above.
show_result_set("""select *from addresseswhere not ZipCode in ('23220','23284')order by StreetName,AddressLabel""",pageLength=10)
Loading ITables v2.5.2 from the internet...
(need help?)
List of Street Names
Loading ITables v2.5.2 from the internet...
(need help?)
List AddressLabels with mismatched AddressBase
Loading ITables v2.5.2 from the internet...
(need help?)
List Addresses with mismatched AddressExtension
Loading ITables v2.5.2 from the internet...
(need help?)
List addresses that don’t add up
Loading ITables v2.5.2 from the internet...
(need help?)
List of valid unit types
Loading ITables v2.5.2 from the internet...
(need help?)
List of Addresses with unittype in AddressLabel and missing ExtensionWithUnit
Loading ITables v2.5.2 from the internet...
(need help?)
View All Addresses
Loading ITables v2.5.2 from the internet...
(need help?)
Examine Parcels
FanUse by FanUseType
Loading ITables v2.5.2 from the internet...
(need help?)
FanOwner - Parcel owner address matches building address number, parcel owner zip is 23220, and parcel has mailable USPS address for owner of record. These parcel owners have their tax record mailed to this parcel address, so they’re probably the owner.
FanRental - Parcel owner address doesn’t match building address number. So, parcel owner address is OUTSIDE the Fan. It’s possible that the owner uses a different address for tax bill. NOTE - the parcel database doesn’t contain addresses for these parcels. We don’t have Fan addresses for these parcels, just the address of the tax owner. For example, the address for Joe’s Inn isn’t in parcel database. The parcel owner is outside the Fan.
FanIgnore - not a mailable address, probably a park, parking lot, common area, etc.
FanUse, LandUse by FanUseType
Loading ITables v2.5.2 from the internet...
(need help?)
FanUse, LandUse, PropertyClass by FanUseType
Loading ITables v2.5.2 from the internet...
(need help?)
View All Parcels
Use the Sort field to select subsets of parcels.
Loading ITables v2.5.2 from the internet...
(need help?)
Create merge file
Loading ITables v2.5.2 from the internet...
(need help?)
Loading ITables v2.5.2 from the internet...
(need help?)
Loading ITables v2.5.2 from the internet...
(need help?)
Loading ITables v2.5.2 from the internet...
(need help?)
Loading ITables v2.5.2 from the internet...
(need help?)
Loading ITables v2.5.2 from the internet...
(need help?)
Finally, create interim table
Data Dictionary
Perfect 👍 — here’s a full schema documentation for parcel_address_join_with_flags, structured for readability and ready to drop into a README, Quarto doc, or data dictionary.
📑 Table Schema: parcel_address_join_with_flags
Description This table links Parcels (tax assessment records) with Addresses (city-recognized mailing addresses), enriched with classification fields and quality-control flags.
One row for every Parcel ↔︎ Address combination.
Contains all relevant attributes from both source datasets plus derived fields for analysis.
Stored in Parquet (ParcelAddressJoin_in_fan.parquet) for reuse.
1. Parcel Attributes
(From the Parcels dataset)
Column
Type
Description
ParcelID
INTEGER
Unique parcel identifier in the city GIS.
PIN
VARCHAR
Parcel Identification Number (tax lot ID).
CountOfPIN
INTEGER
Number of PINs associated with this parcel (multi-PIN parcels).
OwnerName
VARCHAR
Name of the current owner of the parcel.
MailAddress
VARCHAR
Mailing street address of the owner (may differ from property location).
MailCity
VARCHAR
City portion of the owner’s mailing address.
MailState
VARCHAR
State portion of the owner’s mailing address.
MailZip
VARCHAR
ZIP code of the owner’s mailing address.
AssessmentDate
VARCHAR
Date of most recent assessment.
LandValue
DOUBLE
Assessed land value.
DwellingValue
DOUBLE
Assessed building value.
TotalValue
DOUBLE
Total assessed value.
LandSqFt
DOUBLE
Parcel land area in square feet.
ProvalAsmtNhood
VARCHAR
Assessment neighborhood code.
TaxExemptCode
VARCHAR
Tax exemption code (if applicable).
PropertyClassID
VARCHAR
Internal property classification ID.
PropertyClass
VARCHAR
Human-readable property classification (e.g., R Two Story, B Commercial Common Area Main).
LandUse
VARCHAR
Land use category (e.g., Single Family, Multi-Family, Commercial, Vacant).
OwnerOccupied
BIGINT
Flag (1/0) if the property is owner-occupied.
FanUse
VARCHAR
Local neighborhood-specific use classification.
FanUseType
VARCHAR
Derived Fan District use type (e.g., FanOwner, FanRenter).
FanUseOrder
BIGINT
Ordering value for FanUse categories.
parcel_geometry
GEOMETRY
Polygon geometry of the parcel boundary.
2. Address Attributes
(From the Addresses dataset)
Column
Type
Description
AddressId
VARCHAR
Unique address record identifier.
AddressLabel
VARCHAR
Formatted mailing address (human-readable, e.g. 401 N Robinson St Apt A).
BuildingNumber
VARCHAR
Street/building number.
StreetDirection
VARCHAR
Street direction (e.g., N, S, E, W).
StreetName
VARCHAR
Street name.
StreetType
VARCHAR
Street type (e.g., St, Ave, Blvd).
ExtensionWithUnit
VARCHAR
Address extension (e.g., Rear, Suite, Apt).
UnitType
VARCHAR
Unit type (e.g., Apt, Unit, Suite).
UnitValue
VARCHAR
Unit number or identifier.
ZipCode
VARCHAR
Postal ZIP code.
Mailable
VARCHAR
“Yes” if USPS deliverable, “No” otherwise.
StatePlaneX
DOUBLE
X coordinate (State Plane projection).
StatePlaneY
DOUBLE
Y coordinate (State Plane projection).
Latitude
DOUBLE
Latitude coordinate (WGS84).
Longitude
DOUBLE
Longitude coordinate (WGS84).
address_geometry
GEOMETRY
Point geometry of the address location.
3. Derived Fields (Flags and Counts)
Column
Type
Description
is_mailable
INTEGER
1 if Mailable = 'Yes', else 0.
address_count
INTEGER
Number of addresses linked to the parcel.
mailable_count
INTEGER
Number of mailable addresses linked to the parcel.
match_category
VARCHAR
Classification of parcel–address relationship. Possible values: - No match - Single address (mailable) - Single address (not mailable) - Multiple addresses (1 mailable) - Multiple addresses (many mailable) - Multiple addresses (none mailable) - Unclassified.
expectation_flag
VARCHAR
Quality flag comparing match_category against LandUse expectations. Possible values: - Expected - Unexpected - Review.
label_parcel_count
INTEGER
Number of distinct parcels sharing the same AddressLabel.
4. Row Semantics
Each row = one parcel–address combination.
Parcels with multiple addresses → appear multiple times.
Together, this table enables analysis at the parcel-level, address-level, or household-level.
✅ With this schema, you can:
Join members (by AddressLabel or AddressId) and check anomalies (expectation_flag).
Collapse to households (using AddressLabel, label_parcel_count).
Do penetration analysis across LandUse / PropertyClass.
👉 Do you want me to also sketch a schema diagram (like an ERD-style box with key columns and relationships between parcels, addresses, and the join) so you have a visual to drop into documentation?
How many rows share the same AddressLabel?
Loading ITables v2.5.2 from the internet...
(need help?)